PhytoOracle Phenomics Data Processing Pipelines

From Field Preparation to Phenotype Information

Emmanuel Miguel Gonzalez emmanuelgonz.github.io (School of Plant Sciences, University of Arizona, Tucson, AZ)
2023-12-21

Field Preparation

The following steps must be completed prior to planting:

  1. Shape raised beds
  2. Set up sprinkler irrigation pipes, heads, gaskets, and filters
  3. Inject subsurface drip irrigation tape
  4. Place string and labeled stakes in the field

Note: These steps are carried out by Pauli lab members a few weeks before planting.

After completing these steps, the field will look like Figure 1.

South gantry field with shaped raised beds, sprinkler irrigation, and strings and stakes.

Figure 1: South gantry field with shaped raised beds, sprinkler irrigation, and strings and stakes.

Planting

Lettuce planting generally occurs around Mid-November to early-December. The mean air temperature during this time has previously ranged from 10 °C to 22 °C (Figure 2).

Mean air temperature data collected by The Arizona Meteorologial Network (AZMET). Orange vertical lines represent the day of planting. DOP, day of planting; S13, season 13; S15, season 15; S17, season 17.

Figure 2: Mean air temperature data collected by The Arizona Meteorologial Network (AZMET). Orange vertical lines represent the day of planting. DOP, day of planting; S13, season 13; S15, season 15; S17, season 17.

Equipment

Planting is done by hand using Earthway garden seed planters (Figure 3 Left). Lettuce seeds must be planted at a depth of 1/8 to 1/4 inch. The planting depth can be set using the adjustable screw at the bottom of the seed planter - ensure this is set to an acceptable depth throughout planting as it can shift. Also, make sure that the chain is not tangled at the bottom of the planter, as it is meant to cover the soil after the planter penetrates the soil during planting. If the chain is tangled, seeds will not be covered with soil, and thus, may not germinate or be blown/washed away.

The Earthway planters were modified by fitting them with funnels and tubing that allows the user to manually hand-feed the small lettuce seeds instead of using the provided seed container and plates. Planting is carried out by members of the Pauli, Arnold, and Michelmore labs. People are paired up with one person responsible for planting the seeds with the Earthway planter, and the other responsible for ensuring the correct plot numbers are being planted and that the correct seed is provided to the person planting (Figure 3 Right).

Lettuce hand planting. (Left) Earthway garden seed planter. (Right) One person planting using the Earthway planter, while the other is responsible for ensuring correct plot numbers and handing the correct seed to the person planting.Lettuce hand planting. (Left) Earthway garden seed planter. (Right) One person planting using the Earthway planter, while the other is responsible for ensuring correct plot numbers and handing the correct seed to the person planting.

Figure 3: Lettuce hand planting. (Left) Earthway garden seed planter. (Right) One person planting using the Earthway planter, while the other is responsible for ensuring correct plot numbers and handing the correct seed to the person planting.

Potential Issues During Planting

In past years, the tubing that feeds the seeds into the ground have gotten pinched or otherwise clogged. In these cases, entire columns were inadequately planted - the seed did not make it into the seed line of the expected plot. When this happens, Drs. Duke Pauli and Maria José Truco are notified. The plots within the specific column/s are noted. If seed is not immediately available, Dr. Maria José Truco sends it from Davis, California.

Ground Control Points

The raw data collected by the Field Scanalyzer has a high level of misalignment of images and point clouds. To mitigate this error, a high number of ground control points (GCPs) are placed in the field. These GCPs include (Figure 4): - White plastic bucket lids, four columns into the field on both east and west ends - Umbrella holders with grey metal bucket lids, trench between four and five columns into the field on both east and west ends

Ground control points (GCPs) used in the gantry field. (Left) White plastic bucket lid. (Right) Umbrella holder with grey metal bucket lid.Ground control points (GCPs) used in the gantry field. (Left) White plastic bucket lid. (Right) Umbrella holder with grey metal bucket lid.

Figure 4: Ground control points (GCPs) used in the gantry field. (Left) White plastic bucket lid. (Right) Umbrella holder with grey metal bucket lid.

Each range contains a single white plastic bucket lid and two umbrella holders with grey metal bucket lids in the following arrangement (Figure 5):

Arrangement of ground control points (GCPs) in the gantry field. (Left) Each range contains a single white plastic bucket lid and two umbrella holders with grey metal bucket lids. (Right) White plastic bucket lids are alternated, to ensure robust geocorrection.Arrangement of ground control points (GCPs) in the gantry field. (Left) Each range contains a single white plastic bucket lid and two umbrella holders with grey metal bucket lids. (Right) White plastic bucket lids are alternated, to ensure robust geocorrection.

Figure 5: Arrangement of ground control points (GCPs) in the gantry field. (Left) Each range contains a single white plastic bucket lid and two umbrella holders with grey metal bucket lids. (Right) White plastic bucket lids are alternated, to ensure robust geocorrection.

Thinning

Thinning is a very important part of the field trial. The planters often result in clusters of seeds germinating close to each other. Thinning is conducted in two phases (Figure 6):

Change in plant density after multiple rounds of thinning. (Left) Plants after Phase 1 of thinning. (Right) Plants after Phase 2 of thinning.Change in plant density after multiple rounds of thinning. (Left) Plants after Phase 1 of thinning. (Right) Plants after Phase 2 of thinning.

Figure 6: Change in plant density after multiple rounds of thinning. (Left) Plants after Phase 1 of thinning. (Right) Plants after Phase 2 of thinning.

The 10 individual plants resulting from Phase 2 should be equidistant. The equidistant placement reduces any overlap with neighboring plants. This is an important step as the goal with the Field Scanalyzer data is to phenotype each plant individually. The farther plants are, the easier it is to individually phenotype them.

Positioning Information Preparation

The Global Positioning System (GPS) coordinates of each GCP must be collected so they can be used in PhytoOracle workflows. To accomplish this, you need a Trimble Global Navigation Satellite System (GNSS) (Figure 7).

Trimble Global Navigation Satellite System (GNSS) used to collect accurate Global Positioning System (GPS) coordinates of Ground Control Points (GCPs).

Figure 7: Trimble Global Navigation Satellite System (GNSS) used to collect accurate Global Positioning System (GPS) coordinates of Ground Control Points (GCPs).

Collecting Global Positioning System (GPS) Coordinates

The United States Department of Agriculture (USDA) Arid Land Agricultural Research Center (ALARC) has trimbles that we can borrow. To use them, follow the steps below:

  1. Run Trimble Access - Press Trimble hard key (Windows symbol), select Trimble Access

  2. Log in — Click either “Tap here to log in” or the current logged in person (e.g., kelly.thorp)

  1. Set up a job - Click General Survey -> Jobs
  1. To measure points
  1. To stake flags at point locations

Positioning Information Files Required by PhytoOracle

PhytoOracle relies on geospatial information, such as GPS coordinates, to accurately link phenotypes with a location in the field. This allows us to detect, tag, and track individual plants over the course of multiple Field Scanalyzer scans. Specifically, PhytoOracle requires two files:

  1. GCP file: Text file containing the GPS coordinates of all field GCPs.
  2. GeoJSON: File containing polygons representing each plot in the gantry field

These files must be generated prior to data processing for the respective season. Additionally, these files should be loaded onto QGIS for visual inspection and confirmation that the coordinates are accurate.

Generating GCP File

The Trimble collects GPS coordinates in the Easting, Northing format (Table 1). PhytoOracle requires GPS coordinates to be in the latitude, longitude format. To convert the coordinates, use the gcp_coordinates_conversion repository to use the conversion tool. After running the conversion script, the data will now be in the required latitude, longitude format (Table 2).

Table 1: Ground Control Point (GCP) coordinate file. Each row represents the coordinate of a single GCP.
GCP Type Northing Easting Height..m.
plate1 White 3659979 408992.8 360.775
plate2 White 3659987 408992.9 360.788
plate3 White 3659995 408992.9 360.783
plate4 White 3660003 408993.0 360.770
plate5 White 3660011 408993.1 360.775
plate6 White 3660019 408993.1 360.765
Table 2: Ground Control Point (GCP) coordinate file. Each row represents the coordinate of a single GCP.
GCP number Latitude Longitude
1 33.07470 -111.975
2 33.07478 -111.975
3 33.07485 -111.975
4 33.07492 -111.975
5 33.07499 -111.975
6 33.07506 -111.975

Generating GeoJSON File

GeoJSON files contain polygons that represent each plot in the gantry field (Figure 8). These polygons are used to extract smaller experimental units from larger units, such as the full field scale.

GeoJSON file containing a single polygon for each plot.

Figure 8: GeoJSON file containing a single polygon for each plot.

Our field design and dimensions remain pretty consistent from one season to the next. As a result, existing GeoJSONs are modified and applied to new seasons. In the case that a new GeoJSON needs to be created, please refer to FIELDimageR.

If you are editing a pre-existing GeoJSON, you will need to:

  1. Move polygons that are misaligned in the new season
  2. Rename genotype column

Moving polygons

To move polygons, you need to load the GeoJSON and a drone orthomosaic onto QGIS. Then, you can follow the steps in Figure 9:

  1. Click “Toggle Editing”
  2. Click “Select Features by Area or Single Click”
  3. Click “Move Features”
  4. Manually move polygon into desired alignment
  5. Single click to drop the polygon into the desired location
  6. Save changes
Editing GeoJSON polygons using QGIS.

Figure 9: Editing GeoJSON polygons using QGIS.

Renaming genotype column

The “genotype” values in the GeoJSON file can be edited using GeoPandas. A GeoJSON can be opened up as a dataframe, similar to Pandas. Once opened, you can then replace the “genotype” columns using the fieldbook for the respective season. To see an example click here.

Pipeline Preparation

The PhytoOracle (PO) pipelines require the aforementioned GCP and GeoJSON files. Additionally, a Yet Another Markup Language (YAML) file is used by PO for automated, reproducible data processing. YAMLs are a form of a configuration file that can be used to define a series of arguments/flags. The details of the YAML files can be found on our PhytoOracle Automation repository.

Editing YAML file

For each season, YAML files must be edited to correctly process data for the respective season.

Specifically, the following keys should be edited for each season:

Examples of YAMLs for each season can be found here.

Updating GitHub Repositories

PhytoOracle Data

At the start of a new season, the phytooracle_automation repo. Specifically, the season_config_yaml variable needs to be updated.

The season is defined by multiple keys, including name, start_date, end_date, flir_temp_units, and complete_field_dates (Figure 10).

Below are some details for each key:

Section of the $season_config_yaml$ variable in the $phytooracle_data$ GitHub repository.

Figure 10: Section of the \(season_config_yaml\) variable in the \(phytooracle_data\) GitHub repository.

To get a list of RGB dates, use iRODS to ils the directory of the respective season (Figure 11).

Getting RGB dates using iRODS.

Figure 11: Getting RGB dates using iRODS.

PhytoOracle Landmark Selection

The PhytoOracle 3d_landmark_selection contains the phytooracle_data repository. As such, the 3d_landmark_selection container must be rebuilt once the abovementioned changes have been made to phytooracle_data repository and click on “Trigger” for the “latest” container (Figure 12).

Rebuilding the $3d_landmark_selection$ container on DockerHub.

Figure 12: Rebuilding the \(3d_landmark_selection\) container on DockerHub.

Intro to High Performance Computers

UArizona High Performance Computing Cluster

The University of Arizona maintains an HPC center, which houses three compute resources: El Gato, Ocelote, and Puma.

Compute System

Name El Gato Ocelote Puma
Model IBM System X iDataPlex dx360 M4 Lenovo NeXtScale nx360 M5 Penguin Altus XE2242
Node Count 131 400 236 CPU-only
8 GPU
2 High-memory
Total System Memory (TB) 26TB 82.6TB 128TB
Processors 2x Xeon E5-2650v2 8-core (Ivy Bridge) 2x Xeon E5-2695v3 14-core (Haswell) 2x AMD EPYC 7642 48-core (Rome)
2x Xeon E5-2695v4 14-core (Broadwell)
4x Xeon E7-4850v2 12-core (Ivy Bridge)
Cores / Node (schedulable) 16c 28c (48c - High-memory node) 94c
Total Cores 2160* 11528* 23616*
Processor Speed 2.66GHz 2.3GHz (2.4GHz - Broadwell CPUs) 2.4GHz
Memory / Node 256GB - GPU nodes 192GB (2TB - High-memory node) 512GB (3TB - High-memory nodes)
64GB - CPU-only nodes
Accelerators 46 NVIDIA P100 (16GB) 29 NVIDIA V100S
/tmp ~840 GB spinning ~840 GB spinning ~1440 TB NVMe
/tmp is part of root filesystem /tmp is part of root filesystem /tmp
HPL Rmax (TFlop/s) 46 382
OS Centos 7 CentOS 7 CentOS 7

Compute Resources

The UArizona HPC provides three types of resources:

*Note: High priority is only available for the Puma cluster.

Running PhytoOracle on High Performance Computers

PhytoOracle is a scalable, modular phenomics data processing workflow manager. In short, this means that PhytoOracle can leverage high performance computer (HPC) clusters and cloud computing to distributed tasks across hundreds to thousands of cores.

Defining Compute Resources

Resources are defined in the workload_manager section of the PhytoOracle YAML. In this section, you can define many compute resource settings. Below is an example:

Before Deploying PhytoOracle

There are a few things you must ensure before deploying PhytoOracle: - Confirm existence and accuracy of GCP file - Visually inspect on QGIS. Confirm correct placement of GCPs by overlaying the points with an RGB orthomosaic, either drone or gantry. - Confirm existence and accuracy of GeoJSON file - Visually inspect on QGIS. Checking plot number sequence and genotype values.

If these steps are not followed, errors can propagate to multiple levels of data processing, requiring a reprocessing of data.

Supported Data Types

The Field Scanner collects two-dimensional (2D) and three-dimensional (3D) data types, including scannerTop3D (3D), stereoTop (RGB), ps2Top (fluorescence), and flirIrCamera (thermal) (Figure 13).

Data types collected by the Field Scanner. Two-dimensional (2D) data types include RGB, fluorescence, and thermal images, while three-dimensional (3D) include 3D point clouds.

Figure 13: Data types collected by the Field Scanner. Two-dimensional (2D) data types include RGB, fluorescence, and thermal images, while three-dimensional (3D) include 3D point clouds.

2D Field Scanner Data Types

The 2D data collected by the Field Scanner includes stereoTop (RGB), flirIrCamera (thermal), and ps2Top (fluorescence). These data process relatively quickly as they are much lower in size compared to 3-dimensional (3D) data. The processing of 2D data types are fully developed for both lettuce and sorghum (Figure 14).

Visualization of 2D data processing by PhytoOracle.

Figure 14: Visualization of 2D data processing by PhytoOracle.

3D Field Scanner Data Types

The major goal of the PhytoOracle project is to phentoype individual plants at a high spatial-temporal scale. To accomplish this, individual plant positioning information (GPS coordinates) collected during 2D data processing are leveraged to extract data from 3D data (Figure 15).

Visualization of 3D data processing by PhytoOracle.

Figure 15: Visualization of 3D data processing by PhytoOracle.

As such, much focus has been placed on 3D point cloud data. These data undergo intensive processing to extract individual plant point clouds (Figure 16).

Individual plant point clouds processed by PhytoOracle.

Figure 16: Individual plant point clouds processed by PhytoOracle.

Deploying PhytoOracle

After (i) checking the GCP and GeoJSON files (Section ??) and (ii) generating a YAML file (Section ??), you are now ready to run PhytoOracle.

PhytoOracle is made up of multiple workflows to process 2-dimensional (2D) and 3-dimensional data (Figure 17). These workflows allow for automated, scalable processing of raw data collected by the Field Scanner. The data processing results in high spatial-temporal phenotype information.

PhytoOracle workflows for processing raw data collected by the Field Scanner.

Figure 17: PhytoOracle workflows for processing raw data collected by the Field Scanner.

PhytoOracle is mainly deployed on the UArizona HPC. The next sections provides a brief description of how to run each workflow. For additional details, please refer to the PhytoOracle publication. In all cases, the commands provided will automatically handle all steps of processing, including:

stereoTop

The stereoTop workflow runs image stitching and plant detection, resulting in the extraction of bounding area and GPS coordinate information for each plant. The workflow is run as follows:

sbatch shell_scripts/slurm_submission_large.sh <yaml_file>
For example, if you wanted to run this for season 15:
sbatch shell_scripts/slurm_submission_large.sh  
yaml_files/season_15/stereoTop_level01_s15.yaml

flirIrCamera

The flirIrCamera workflow runs image stitching and plant detection, resulting in the extraction of canopy temperature and GPS coordinate information for each plant. The workflow is run as follows:
sbatch shell_scripts/slurm_submission_large.sh <yaml_file>
For example, if you wanted to run this for season 15:
sbatch shell_scripts/slurm_submission_large.sh  
yaml_files/season_15/flirIrCamera_level01_s15.yaml

ps2Top

The ps2Top workflow applies a threshold to fluorescence plot-centered images, resulting in the extraction of maximum potential quantum efficiency of Photosystem II (Fv/Fm).
sbatch shell_scripts/slurm_submission_large.sh <yaml_file>
For example, if you wanted to run this for season 15:
sbatch shell_scripts/slurm_submission_large.sh  
yaml_files/season_15/ps2Top_level01_s15.yaml

scanner3DTop

The scanner3DTop workflow runs point cloud stitching leverages GPS coordinates collected during stereoTop processing, resulting in the extraction of traditional and topological shape descriptors for each plant. This worlflow involves multiple levels of processing, including:

For example, if you wanted to run level 1 processing for season 15:
sbatch shell_scripts/slurm_submission.sh  
yaml_files/season_15/scanner3DTop_level01_s15.yaml
To run level 2 processing for season 15:
sbatch shell_scripts/slurm_submission.sh  
yaml_files/season_15/scanner3DTop_level02_s15.yaml

*Note: Notice that scanner3DTop level 1 and 2 processing uses the shell_scripts/slurm_submission.sh instead of shell_scripts/slurm_submission_large.sh. This is because the manager node performs no processing, it merely provides the tasks and sends them to worker nodes. As such, the manager node only requires two cores instead of 94.

Quality Control & Quality Assurance of PhytoOracle Processed Data

Although PhytoOracle is reproducible due to the use of containers and YAML configuration files, it is important to follow quality control (QC) and quality assurance (QA) steps after data processing. The recommended steps for this are:

If any errors are spotted during these QA/QC steps, immediately notify the project lead. Depending on the impact of the error, data may need to be reprocessed to ensure data integrity.